This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.
The classification goal is to predict the likelihood of a liability customer buying personal loans.
#import the necessary Libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
#populate dataframe with file
bank_records= pd.read_csv("Bank_Personal_Loan_Modelling.csv")
#View top records
bank_records.head()
bank_records.shape
print(bank_records.columns)
# Get View Of All the columns
bank_records.info()
There are no null value in any column
bank_records.dtypes
# Chech if there is any null Value
bank_records.isnull().any()
No Null value found
Salary: Salary can be one of the major dependent variables as customers with high salaries are less feasible to buy personal loans while customers with medium or low salaries are more feasible for buying personal loans. The number of family members: More the number of earning family members, less probability of buying personal loans.
Age: Customers with probably the age of 30–50 will buy personal loans.
Education of the customer: The customer is a graduate or under-graduate can affect the buying probability, people who are graduated or Advanced Professionals are more viable to buy personal loans from a bank rather than people who are under-graduated.
Mortgage : if the customer already has mortgage customer might be under debt and likely to buy a personal loan.
# Check if there is any negative value in dataframe which could effect the reading
bank_records.agg(lambda x: sum(x < 0)).sum()
there exist 52 negative values through out the data frame , we need be sure if this value is any column which should not have negative values
#Experience Column should not have any negative values since experinece can not be negative
bank_records[bank_records['Experience']<0]['Experience'].count()
#Replacing the Negative value with numpy nan nan
bank_records[bank_records['Experience'] < 0] = np.nan
# Validating if there exists any ngative value post modification
bank_records[bank_records['Experience'] <0]['Experience'].count()
#Replace All the NaN with median Values
bank_records['Experience'].fillna(bank_records['Experience'].median(),inplace=True)
# Plot the Data Distribution
sns.pairplot(data=bank_records,kind='kde')
# install pandas profiling an tool to perform autamated EDA
import sys
!{sys.executable} -m pip install pandas-profiling
# Performing the univariant analysis with the delp of pandas profiling
import pandas_profiling
bank_records.profile_report()
The data set got 0 missing cells. It got 7 numeric variables: ‘Age’, ‘CC_Avg’, ‘ID’, ‘Income’, ‘Mortgage’, ‘Zip_Code’, ‘Experience’ It got 2 categorical variables: ‘Education’, ‘Family’ It got 5 Boolean variables: ‘CD_Account’, ‘Credit_Card’, ‘Online’, ‘Personal_Loan’, ‘Securities Account’ Personal Loan is highly correlated with Income, average spending on Credit cards, mortgage & if the customer has a certificate of deposit (CD) account with the bank. Also, Experience is highly correlated with Age (ρ = 0.994214857)
42% of the candidates are graduated, while 30% are professional and 28% are Undergraduate. Around 29% of the customer’s family size is 1.
94% of the customer doesn’t have a certificate of deposit (CD) account with the bank. Around 71% of the customer doesn’t use a credit card issued by UniversalBank. Around 60% of customers use internet banking facilities. Around 90% of the customer doesn’t accept the personal loan offered in the last campaign. Around 90% of the customer doesn’t have a securities account with the bank.
The mean age of the customers is 45 with standard deviation of 11.5. Also, we had estimated the average age in hypothesis testing between 30–50. The curve is slightly negatively skewed (Skewness = -0.02934068151) hence the curve is fairly symmetrical The mean of Avg. spending on credit cards per month is 1.93 with standard deviation of 1.75. The curve is highly positive skewed (Skewness = 1.598443337) The mean annual income of the customer is 73.77 with standard deviation of 46. The curve is moderately positive skewed (Skewness = 0.8413386073) The mean value of house mortgage is 56.5 with standard deviation of 101.71. The curve is highly positive skewed (Skewness = 2.104002319) and there are a lot of outlier’s present (Kurtosis = 4.756796669)
# Dropping of ID’, ‘ZIP code’ & ‘Experience’ columns for further analysis since ‘ID’ and ‘ZIP_Code’
#are just numbers of series
#‘Experience’ is highly correlated with ‘Age’.
bank_records.drop('ID',axis=1 ,inplace=True)
bank_records.drop('ZIP Code',axis=1 ,inplace=True)
bank_records.drop('Experience',axis=1 ,inplace=True)
# Validating if all the columns are dropped
bank_records.columns
As we saw earlier in univariate analysis, Mortgage contains outliers, so we must treat them as the presence of outliers affects the distribution of the data
from scipy import stats
bank_records['Mortgage_Zscore']=np.abs(stats.zscore(bank_records['Mortgage']))
bank_records=bank_records[bank_records['Mortgage_Zscore']<3]
bank_records.drop('Mortgage_Zscore',axis=1,inplace=True)
bank_records.shape
# Personal Loan is a Target Varaible of Boolean type
# 0 = Loan not granted in Last Campaign (90.4%)
# 1 = Loan not granted in Last Campaign (9.6%)
bank_records["Personal Loan"].value_counts()
#Plotting the graph for count of people who took loan
sns.countplot(bank_records["Personal Loan"])
# Looking into the distribution to the various attributes in relation with the target.
bank_records.groupby(bank_records['Personal Loan']).mean()
Observations: 1). The average Income of customers who took loan is more than double of the avg income of customers who didn’t take loan last year.
2). The Avg. spending on credit cards per month ($000) is also more than double for the customer's who took loan.
3). The average mortage for loan availing customers is approximately double for the not availing customers.
4). Avg literacy is less for non loan takers.
As given in the data description that person who took loan in the last camping is 9.6%.
# Seprating the dependent and indepenedent variable and dividing in 70 : 30 Ratio
X= bank_records.drop('Personal Loan', axis=1)
y= bank_records['Personal Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Creating Logistic Regression Model
logistic_reg_model= LogisticRegression(solver='liblinear', max_iter=1000)
# Fitting the x and y in logistic model
logistic_reg_model.fit(X_train, y_train)
# prediticting based on the logistic regression
y_pred_logistic = logistic_reg_model.predict(X_test)
print('Accuracy of logistic regression classifier on training data set: {:.2f}'.format(logistic_reg_model.score(X_train, y_train)))
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logistic_reg_model.score(X_test, y_test)))
Even though the accuracy is ver high ,number of buyer’s percentage to the non-buyer percentage is very less. Hence accuracy didn’t play a big role in determining how our model performed. Lets determine the other performance parameters
# Printing the Performance metrics
print(classification_report(y_test,y_pred_logistic))
# printing the Acurracy Score
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred_logistic))
#printing the Confusion matrix
print(confusion_matrix(y_test,y_pred_logistic))
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logistic_r_probability=logistic_reg_model.predict_proba(X_test)
fpr_l,tpr_l,thr_l=roc_curve(y_test,logistic_r_probability[:,1])
roc_l = metrics.auc(fpr_l,tpr_l)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_l)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Logistic Regression')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
logistic_r_probability=logistic_reg_model.predict_proba(X_test)
fpr_l,tpr_l,thr_l=roc_curve(y_test,logistic_r_probability[:,1])
roc_l = metrics.auc(fpr_l,tpr_l)
print('Area under ROC curve %f'%roc_l)
As mentioned before, accuracy alone can’t define my model how well it predicted so we will try to improve the recall by scaling the features and trying yo see the performance of models
from sklearn import preprocessing
colmn_names=bank_records.columns
scaler=preprocessing.StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test =scaler.fit_transform(X_test)
log_r_model= LogisticRegression(solver='liblinear')
log_r_model.fit(scaled_X_train,y_train)
y_pred = log_r_model.predict(scaled_X_test)
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
logistic_probability=log_r_model.predict_proba(scaled_X_test)
fpr2,tpr2,thr2=metrics.roc_curve(y_test,logistic_probability[:,1])
roc_2 = metrics.auc(fpr2,tpr2)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_2)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Logistic Regression')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
logistic_probability=log_r_model.predict_proba(scaled_X_test)
fpr2,tpr2,thr2=metrics.roc_curve(y_test,logistic_probability[:,1])
roc_2 = metrics.auc(fpr2,tpr2)
print('Area under ROC curve %f'%roc_2)
We get a recall value of 59%, which means our model did much better in predicting True Positives. Also, the area under the curve is around 95%
# Using the KNN Classification Algorithm
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(scaled_X_train,y_train)
y_pred_knn=knn_model.predict(scaled_X_test)
print('Accuracy of KNN classifier on training data set: {:.2f}'.format(knn_model.score(scaled_X_train, y_train)))
print('Accuracy of KNN classifier on test set: {:.2f}'.format(knn_model.score(scaled_X_test, y_test)))
# Printing the Performance Metrics
print(classification_report(y_test,y_pred_knn))
print(accuracy_score(y_test,y_pred_knn))
print(confusion_matrix(y_test,y_pred_knn))
knn_probability=knn_model.predict_proba(scaled_X_test)
fpr_knn,tpr_knn,thr_knn=metrics.roc_curve(y_test,knn_probability[:,1])
roc_knn = metrics.auc(fpr_knn,tpr_knn)
plt.figure()
plt.plot(fpr, tpr, label='KNN (area = %0.2f)' % roc_knn)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for KNN')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
knn_probability=knn_model.predict_proba(scaled_X_test)
fpr_knn,tpr_knn,thr_knn=metrics.roc_curve(y_test,knn_probability[:,1])
roc_knn = metrics.auc(fpr_knn,tpr_knn)
print('Area under ROC curve %f'%roc_knn)
from sklearn.naive_bayes import GaussianNB
naive_model = GaussianNB()
naive_model.fit(scaled_X_train,y_train)
y_pred_naive=naive_model.predict(scaled_X_test)
print('Accuracy of Naive Bayes classifier on training data set: {:.2f}'.format(naive_model.score(scaled_X_train, y_train)))
print('Accuracy of Naive Bayes on test set: {:.2f}'.format(naive_model.score(scaled_X_test, y_test)))
# Printing the Performance Metrics
print(classification_report(y_test,y_pred_naive))
print(accuracy_score(y_test,y_pred_naive))
print(confusion_matrix(y_test,y_pred_naive))
naive_probability=naive_model.predict_proba(scaled_X_test)
fpr_naive,tpr_naive,thr_naive=metrics.roc_curve(y_test,naive_probability[:,1])
roc_naive = metrics.auc(fpr_naive,tpr_naive)
plt.figure()
plt.plot(fpr, tpr, label='Naive Bayes (area = %0.2f)' % roc_naive)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Naive Bayes')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
naive_probability=naive_model.predict_proba(scaled_X_test)
fpr_naive,tpr_naive,thr_naive=metrics.roc_curve(y_test,naive_probability[:,1])
roc_naive = metrics.auc(fpr_naive,tpr_naive)
print('Area under ROC curve %f'%roc_naive)
print(confusion_matrix(y_test,y_pred_logistic))
print(confusion_matrix(y_test,y_pred_knn))
print(confusion_matrix(y_test,y_pred_naive))
KNN : Acurracy of KNN classifier whether a person will buy or not is 97 % which is higher than Logistic Regression and Naive Bayes , also the recall rate is 66 % is better than Logistic Regression and Naive Bayes. Area Unnder the curve is also fairly good
Because of above reason KNN is best model of 3 (KNN,Logistic Regression and Naive Bayes)